Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse query comment and use as bigquery job labels. #3145

Merged
merged 10 commits into from
Mar 22, 2021

Conversation

jmcarp
Copy link
Contributor

@jmcarp jmcarp commented Mar 4, 2021

resolves #2483

Description

Try to parse query comment and pass values to bigquery job labels. Just a draft to see if I'm heading in the right direction. @jtcohen6

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt next" section.

@cla-bot cla-bot bot added the cla:yes label Mar 4, 2021
Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool @jmcarp!

Screen Shot 2021-03-04 at 12 33 39 PM

Thinking about how to test this:

  • A unit test could inspect job_params[labels] and confirms that they contain the keys from the default query comment (app, profile_name, target_name, dbt_version, node_id) and the dbt_invocation_id. Maybe the query comment-to-job label conversion should be a standalone function?
  • An integration test could run a model and check that the labels appropriately registered, by querying INFORMATION_SCHEMA.JOBS_BY_USER

@@ -215,6 +215,7 @@ def to_target_dict(self):
class QueryComment(dbtClassMixin):
comment: str = DEFAULT_QUERY_COMMENT
append: bool = False
job_label: bool = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config is query-comment, rather than query_comment (docs). Not exactly sure why we did this, but I guess kebab casing is common in dbt_project.yml / project-level configs. I think this should be job-label (instead of job_label) for consistency

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I switched the base class to HyphenatedDbtClassMixin.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, I may not understand what this base class does, since that change broke tests that I thought were unrelated. Reverted for now, but let me know if you have thoughts about the right way to do this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took another look at this, and I think I found a subtle bug in the casing logic. Should be fixed now.

Comment on lines 324 to 325
except (TypeError, ValueError):
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the comment isn't valid json, should we still include it as a single job label (called query_comment), truncated to 128 characters if need be?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I changed it.

@jmcarp jmcarp force-pushed the jmcarp/bigquery-job-labels branch from 7b4b728 to 8d73ae2 Compare March 4, 2021 15:20
if cls._hyphenated:
# `data` might not be a dict, e.g. for `query_comment`, which accepts
# a dict or a string; only snake-case for dict values.
if cls._hyphenated and isinstance(data, dict):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this make sense @gshank?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at all the existing calls to this class method, it seems to be true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I'm not sure exactly how this code gets reached, but I dropped into a debugger at this point on the unit test suite and found non-dictionary values--specifically, there's a test that sets query-comment to "". You should be able to reproduce the issue by dropping the isinstance check that I added and running something like pytest -s test/unit/test_query_headers.py -k test_disable_query_comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also run that test with the following added to pre_deserialize:

        if cls._hyphenated and not isinstance(data, dict):
            import ipdb; ipdb.set_trace()

Looking at the stack, I'm inside a function called __unpack_union_Project_query_comment__9f5a8d0e89384cc286d2b99acef5623d that I'm guessing is dynamically generated and eval-ed by mashumaro 😱.

@jmcarp jmcarp marked this pull request as ready for review March 5, 2021 14:41
@jtcohen6 jtcohen6 requested review from a team, nathaniel-may and jtcohen6 and removed request for a team March 8, 2021 16:05
Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good from my vantage! One small comment from my testing. I think it'd be good to add a few unit tests of the various inputs/outputs of query comment sanitizing and structuring. Beyond that, I think this is ready for a changelog entry (v0.20.0 Feature).

I also tagged a member of the Core team for python code review.

Comment on lines 594 to 602
_SANITIZE_LABEL_PATTERN = re.compile(r"[^a-z0-9_-]")


def _sanitize_label(value: str, max_length: int = 63) -> str:
"""Return a legal value for a BigQuery label."""
value = value.lower()
value = _SANITIZE_LABEL_PATTERN.sub("_", value)
value = value[: max_length - 1]
return value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anecdotally, this has the effect of appending an extra underscore at the end of my non-dict comment, e.g. comment: whatever in dbt_project.yml becomes query_comment: whatever_ in BigQuery, and comment: something else becomes query_comment: something_else_. Is that intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why this is happening--there may be something upstream of this code adding trailing whitespace, because I don't get that behavior testing the function directly. I tried adding a strip(). Does that help?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the strip() fixed it!

Screen Shot 2021-03-09 at 10 39 33 AM

(The numerous _ in the top example are all special characters.)

@jmcarp jmcarp force-pushed the jmcarp/bigquery-job-labels branch from f3f2f72 to af3c3f4 Compare March 8, 2021 20:44
Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks awesome @jmcarp! Thanks for adding those unit tests.

Functional review is good by me: I've confirmed this works as expected when job-label: True (with default, custom JSON, and custom string comments), and just as crucially that it works as expected when job-label: False or not supplied (the only label is dbt_invocation_id = status quo).

I'll leave final code/style review to @nathaniel-may. After that, this is good to go.

@jmcarp
Copy link
Contributor Author

jmcarp commented Mar 16, 2021

No rush @jtcohen6, but it would be awesome to get this into the next release.

Copy link
Contributor

@nathaniel-may nathaniel-may left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was waiting on me @jmcarp. Thank you for your patience. Once we address my relatively minor questions here, this looks good to merge.

I do want an integration test so that we can detect regressions on this behavior from both dbt and bigquery, but I'm ok ticketing that in a separate PR if you are too @jtcohen6.

if cls._hyphenated:
# `data` might not be a dict, e.g. for `query_comment`, which accepts
# a dict or a string; only snake-case for dict values.
if cls._hyphenated and isinstance(data, dict):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at all the existing calls to this class method, it seems to be true.

@@ -931,3 +941,16 @@ def test_convert_time_type(self):
expected = ['time', 'time', 'time']
for col_idx, expect in enumerate(expected):
assert BigQueryAdapter.convert_time_type(agate_table, col_idx) == expect


@pytest.mark.parametrize(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't seem to use this style of test very often, but I actually really like it so I'm happy to introduce it to this file too.

from dbt.adapters.base.query_headers import MacroQueryStringSetter
from dbt.clients import agate_helper
import dbt.exceptions
from dbt.logger import GLOBAL_LOGGER as logger # noqa
from dbt.context.providers import RuntimeConfigObject

import pytest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't block the merge on this, but I'd rather put this import up just above import unittest.

"""Return a legal value for a BigQuery label."""
value = value.strip().lower()
value = _SANITIZE_LABEL_PATTERN.sub("_", value)
value = value[: max_length]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the behavior we want for long labels? Alternatives include raising our own error, or sending something we know is too long to big query and raising that db exception. As much as this will not often, if ever, be attempted on purpose, a typo could cause this which might be a slightly annoying user experience.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think passing through the input and letting bigquery crash makes sense.

@jmcarp jmcarp force-pushed the jmcarp/bigquery-job-labels branch from b0016ee to 044a6c6 Compare March 20, 2021 05:00
@jtcohen6
Copy link
Contributor

jtcohen6 commented Mar 22, 2021

Thanks for the review @nathaniel-may, and for the quick cleanup @jmcarp! I'll open a separate issue for the integration test (#3184).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support BigQuery Job Tags and Labels
3 participants